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ABSTRACT 

Detecting  and  tracking  face  regions  in  image  sequences  has 
applications  to  important  problems  such  as  face  recognition, 
human-computer  interaction,  and  video  surveillance.  Visi¬ 
ble  sensors  have  inherent  limitations  in  solving  this  task, 
such  as  the  need  for  sufficient  and  specific  lighting  condi¬ 
tions,  as  well  as  sensitivity  to  variations  in  skin  color.  Ther¬ 
mal  infrared  (IR)  imaging  sensors  image  emitted  light,  not 
reflected  light,  and  therefore  do  not  have  these  limitations, 
providing  a  24-hour,  365-day  capability  while  also  being 
more  robust  to  variations  in  the  appearance  of  individuals. 

In  this  paper,  we  present  a  system  for  tracking  human 
heads  that  has  three  components.  First,  a  method  for  mod¬ 
eling  thermal  emission  from  human  skin  that  can  be  used 
for  the  purpose  of  segmenting  and  detecting  faces  and  other 
exposed  skin  regions  in  IR  imagery.  Second,  the  segmen¬ 
tation  model  is  applied  to  the  CONDENSATION  algorithm 
for  tracking  the  head  regions  over  time.  This  includes  a  new 
observation  density  that  is  motivated  by  the  segmentation 
results.  Finally,  we  examine  how  to  use  the  tracking  results 
to  refine  the  segmentation  estimate. 

1.  INTRODUCTION 

While  much  work  has  been  done  on  detecting  [1,  2,  3]  and 
tracking  [4,  5]  humans  in  image  sequences,  most  of  the  ef¬ 
fort  has  been  with  visible  sensors.  Here  we  examine  the 
problem  from  the  point  of  view  of  a  thermal  IR  sensor.  In 
particular  we  examine  both  mid-wave  (MWIR)  and  long¬ 
wave  (LWIR)  sensors. 

There  are  several  advantages  of  using  IR  over  traditional 
visible  wavelength  sensors.  These  advantages  arise  from  the 
fact  that  most  light  in  the  mid-to-long  wave  IR  is  emitted 
rather  than  reflected.  [6]  This  leads  first  to  a  24-hour,  365- 
day  capability  since  the  scene  will  not  be  dependent  on,  and 
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will  be  nearly  invariant  to,  external  lighting  sources  such  as 
the  sun  or  man-made  lights. 

In  addition  to  the  lighting  invariance,  another  aspect  that 
is  of  particular  importance  to  detecting  and  tracking  faces  is 
the  relative  uniformity  of  emissivity  values  of  skin  among 
different  members  of  the  population.  This  contrasts  to  work 
done  in  the  visible  spectrum,  where  albedo  can  vary  signifi¬ 
cantly  from  person  to  person.  While  work  has  been  done  on 
finding  invariants  related  to  skin  color  in  the  visible  spec¬ 
trum  [7],  this  problem  can  be  much  more  easily  solved  with 
a  calibrated  IR  sensor.  More  information  on  calibrated  IR 
can  be  found  in  Section  2. 

Given  calibrated  IR  as  its  input,  we  present  in  Section  3 
a  model  of  skin  imagery  that  can  be  used  to  segment  im¬ 
ages  into  three  classes:  skin,  covered  skin  (as  by  clothes 
or  hair),  and  background,  which  is  anything  else.  We  work 
with  indoor  environments,  but  allow  for  warm  items  such  as 
computers  in  the  background,  without  significant  distraction 
from  clutter. 

Such  an  ability  to  differentiate  between  skin  and  back¬ 
ground  pixels  can  obviously  be  of  great  use  when  trying  to 
track  face  regions.  Many  tracking  algorithms  using  visible 
spectrum  sensors  use  background  subtraction  [2]  to  serve 
this  function.  [5]  Such  background  subtraction  typically  re¬ 
quires  a  fixed  camera  mount,  or  limited  camera  motion. 
While  it  is  possible  to  use  motion  to  perform  such  segmen¬ 
tation  through  3D  reconstruction  [8],  the  relatively  simple 
pixel  process  that  can  be  used  in  conjunction  with  thermal 
IR  sensors  has  significant  advantages  in  computational  cost, 
robustness,  and  the  ability  to  detect  stationary  targets. 

With  the  segmentation,  we  proceed  in  Section  4  to  per¬ 
form  tracking.  We  use  the  segmentation  as  a  feature  set  to 
track  with,  which  has  several  advantages  of  the  more  tradi¬ 
tional  approach  of  using  edge  features  and  contours.  First, 
since  the  quality  of  the  skin  segmentation  is  sufficient  for 
head  detection,  the  tracker  can  self-initialize  reliably.  The 
second  advantage  of  using  the  higher-level  features  is  that 
the  tracker  is  less  prone  to  distractors.  This  is  because  most 
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Fig.  1.  Uncalibrated  (left)  and  calibrated  LWIR  images. 
There  is  significant  pixel-wise  non-uniformity  in  responsiv- 
ity  which  is  removed  by  the  calibration  process. 


Fig.  2.  The  Planck  curve  for  a  black-body  at  303K  (roughly 
skin  temperature),  with  the  area  to  be  integrated  for  an  8  — 
12/im  sensor  shaded. 


of  the  background  is  classified  as  such.  While  some  regions 
may  be  miss-classified,  their  extent  tends  to  be  minimal. 

Just  as  the  results  of  the  segmentation  can  be  useful  for 
tracking,  the  results  of  tracking  can  aid  in  the  segmenta¬ 
tion.  Since  the  segmentation  algorithm  is  trained  on  a  fixed 
population,  and  environmental  conditions  and  variations  in 
the  population  can  cause  variations  in  the  observed  intensi¬ 
ties  of  skin,  we  refine  the  skin-model  estimate  using  data 
collected  from  the  tracker.  This  is  described  in  detail  in 
Section  5.  Finally,  we  present  results  and  conclusions  in 
Sections  6  and  7. 

2.  FROM  THERMAL  VIDEO  TO  PHYSICAL 
MEASUREMENTS 

In  order  to  perform  proper  analysis,  it  is  necessary  that  ther¬ 
mal  IR  imagery  be  radiometrically  calibrated.  Radiometric 
calibration  achieves  a  direct  relationship  between  the  gray- 
value  response  at  a  pixel  and  the  absolute  amount  of  thermal 
emission  from  the  corresponding  scene  element.  This  rela¬ 
tionship  is  called  responsivity.  Thermal  emission  is  mea¬ 
sured  as  flux  in  units  of  power  such  as  Wj cm2.  The  gray- 
value  response  of  thermal  IR  pixels  for  LWIR  and  MWIR 
cameras  is  linear  with  respect  to  the  amount  of  incident  ther¬ 
mal  radiation.  The  slope  of  this  responsivity  line  is  called 
the  gain  and  the  ^/-intercept  is  the  offset.  The  gain  and  offset 
for  each  pixel  on  a  thermal  IR  focal  plane  array  is  signifi¬ 
cantly  variable  across  the  array.  That  is,  the  linear  relation¬ 
ship  can  be,  and  usually  is,  significantly  different  from  pixel 
to  pixel.  This  is  illustrated  in  Figure  1  where  both  calibrated 
and  uncalibrated  images  are  shown  of  the  same  subject. 

While  radiometric  calibration  provides  non-uniformity 
correction,  it  also  provides  the  further  advantage  of  data 
where  environmental  factors  contribute  to  a  much  lesser  de¬ 


gree.  This  is  due  to  the  relationship  back  to  a  physical  pa¬ 
rameter  of  the  imaged  object,  its  emissivity. 

Since  the  responsivity  of  LWIR/MWIR  sensors  is  very 
linear,  the  pixelwise  linear  relation  between  gray  values  and 
flux  can  be  computed  by  a  process  of  two-point  calibra¬ 
tion.  Images  of  a  black-body  radiator  covering  the  entire 
field  of  view  are  taken  at  two  known  temperatures,  and  thus 
the  gains  and  offsets  are  computed  using  the  radiant  flux  for 
a  black-body  at  a  given  temperature. 

Note  that  this  is  only  possible  if  the  emissivity  curve  of 
a  black-body  as  a  function  of  temperature  is  known.  This  is 
given  by  Planck’s  Law,  which  states  that  the  flux  emitted  at 
the  wavelength  A  by  a  blackbody  at  a  given  temperature  T 
in  Wj  (cm2 urn)  is  given  by 


W(X,T) 


2iihc2 

A5  —  1^ 


(1) 


where  h  is  Planck’s  constant,  k  is  Boltzmann’s  constant,  and 
c  is  the  speed  of  light  in  a  vacuum.  To  relate  this  to  the  flux 
observed  by  the  sensor,  the  responsivity,  R(  A)  of  the  sensor 
must  be  taken  into  account.  This  allows  the  flux  observed  by 
a  specific  sensor  from  a  black-body  at  a  given  temperature 
to  be  determined: 


W(T)  =  j W(\,T)R(\)d\  .  (2) 

For  our  sensors,  the  responsivity  is  very  flat  between  8  and 
12  (3  to  5  respectively)  microns,  so  we  can  simply  integrate 
Equation  1  for  A  between  8  and  12.  The  Planck  curve  and 
the  integration  process  are  illustrated  in  Figure  2. 

One  can  achieve  marginally  higher  precision  by  taking 
measurements  at  multiple  temperatures  and  obtaining  the 
gains  and  offsets  by  least  squares  regression.  For  the  case 


of  indoor  thermal  images  containing  human  faces,  we  take 
each  of  the  two  fixed  temperatures  to  be  below  room  tem¬ 
perature  and  above  skin  temperature,  to  obtain  the  highest 
quality  calibration  for  all  levels  of  IR  emission. 

It  should  be  noted  that  calibration  has  a  limited  life  span. 
If  a  LWIR/MWIR  camera  is  radiometrically  calibrated  in¬ 
doors,  taking  it  outdoors  where  there  is  a  significant  ambi¬ 
ent  temperature  difference  will  cause  the  gain  and  offset  of 
linear  responsivity  of  the  focal  plane  array  pixels  to  change. 
Therefore,  radiometric  calibration  must  be  performed  again. 
This  effect  is  mostly  due  to  the  optics  and  focal  plane  ar¬ 
ray  (FPA)  heating  up,  and  causing  the  sensor  to  ‘see’  more 
energy  as  a  result.  Also,  suppose  two  sequences  are  col¬ 
lected  with  different  LWIR/MWIR  cameras  but  with  the  ex¬ 
act  same  model  number,  identical  camera  settings  and  under 
the  exact  same  environmental  conditions.  Nonetheless,  no 
two  thermal  IR  focal  plane  arrays  are  ever  identical  and  the 
gain  and  offset  of  corresponding  pixels  between  these  sepa¬ 
rate  cameras  will  be  different.  Radiometric  calibration  stan¬ 
dardizes  all  thermal  IR  data  collections,  whether  they  are 
taken  under  different  environmental  conditions,  with  differ¬ 
ent  cameras,  or  at  different  times. 

3.  MODELING  SKIN  IN  THERMAL  IR 

For  the  purposes  of  segmentation,  we  classify  pixels  in  in¬ 
door  scenes  as  belonging  to  one  of  three  classes:  exposed 
skin,  covered  skin  (by  clothing  or  hair),  and  background 
(everything  else). 

Our  goal  is  a  probabilistic  model  that  can  help  segment 
these  three  regions.  This  takes  the  form: 

P(c\r)  =  P(ci\r),  (3) 

where  c  e  C  =  {cs,cc,cb}  =  (ci,  02,03}  are  the  three 
classes  of  interest.  Of  course,  in  general  we  could  classify 
into  n  categories  ci,  C2, . . . ,  cn.  The  images  we  are  seg¬ 
menting  come  from  a  calibrated  image,  R ,  and  r  is  the  radi¬ 
ance  coming  from  the  pixel  in  R  under  consideration. 

We  use  a  time-dependent  model  for  the  scene’s  flux  prob¬ 
ability  density  of  the  form 

y>*  =  1,  (4) 

i=  1  i= 1 


of  each  class  in  the  training  sequence  may  not  be  representa¬ 
tive  of  that  in  the  test  data.  However,  given  an  image  frame 
Rf  at  time  t  >  0,  we  can  estimate  the  optimal  priors  in 
the  maximum  likelihood  sense.  At  this  point  it  should  be¬ 
come  clear  why  we  choose  to  segment  into  three  classes. 
Although  we  are  interested  only  in  skin  and  background  for 
the  purposes  of  detection  and  tracking,  modeling  the  back¬ 
ground  as  two  components  is  useful.  Since  our  prior  model 
of  background  cannot  take  into  account  all  possibilities,  we 
break  it  into  the  two  most  prominent  components,  those  be¬ 
ing  covered  skin  and  room  temperature  office  objects.  Since 
we  do  not  know  the  relative  frequencies  of  these  classes 
ahead  of  time,  we  adjust  them  along  with  the  prior  for  the 
skin  class  at  this  point.  This  is  done  by  maximizing  the  log¬ 
arithmic  likelihood  of  Rl  as  a  function  of  the  priors,  given 
as 

r  n  1 


log  L=  ^2  log 

reR 1 


(5) 


We  can  compute  the  corresponding  partial  derivatives  as 


Slog L  _  Pj(r ) 


Note  that  the  maximization  of  Equation  5  should  be  per¬ 
formed  for  7rt  =  (7rf,  7T2, . . . ,  7r^)  in  the  convex  constraint 
set 

n 

K  =  (x  G  Rn  I  =  1,  Xj  >  0},  (7) 

where  subscripts  denote  coordinate  components,  and  n  —  3 
in  our  case.  Geometrically,  K  is  the  face  of  the  unit  sim¬ 
plex  in  Mn  which  does  not  lie  on  a  coordinate  plane  (see 
Figure  3). 

In  order  to  maximize  Equation  5  numerically,  we  use 
the  method  of  iterated  projections  [9].  Let  :  Mn  — ► 
K  denote  the  projection  map  onto  the  set  K  (guaranteed 
to  exist  by  the  convexity  of  K).  The  method  of  iterated 
projections  proceeds  according  to  the  following  scheme 

7T  t’k+i  =  $K(7r*’fe  +  e(Vlogi)(7rt’fe)),  (8) 


with  e  >  0.  This  scheme  converges  to  a  local  maximum 
7 rt,0°  of  Equation  5.  It  remains  to  show  the  construction 
of  For  x  G  Mn,  the  projection  onto  the  hyperplane 
passing  through  the  standard  basis  vectors  e^i  =  1, . . . ,  n, 
is  given  by 


where  P\  is  the  class-conditional  density  for  class  q,  and 
7 Ti  are  the  class  priors  at  time  t.  The  initial  class-conditional 
densities  P f  are  obtained  from  training  data,  by  hand  seg¬ 
menting  the  desired  classes  in  a  video  sequence,  and  esti¬ 
mating  the  corresponding  densities.  Note  that  we  assume 
no  parametric  form  for  P/,  for  any  t  >  0. 

While  the  densities  P -°  can  be  estimated  from  training 
data,  the  class  priors  nj  cannot,  since  the  relative  abundance 


(^if(x))i  =  Xi  +  ^(1  —  '22xj)-  (9) 

}=  1 

The  projection  from  the  hyperplane  onto  the  constraint  set 
K  is  given  by 


(^(x))i 


if  Xi  <  0 
otherwise 


(10) 


Fig.  3.  Constraint  set  for  the  maximum  likelihood  estima¬ 
tion  of  mixture  parameters. 


where  *S+  is  the  sum  of  the  non-negative  coordinates  of  x, 
and  C+  is  the  number  of  strictly  positive  coordinates.  Now 
the  desired  projection  can  be  written  as  4>x  = 

In  order  to  obtain  a  good  estimate  of  the  global  maxi¬ 
mum  likelihood,  we  use  the  scheme  in  Equation  8  within  an 
iterative  random  restarts  loop.  Multiple  runs  of  the  con¬ 
strained  local  optimization  are  performed  starting  at  ran¬ 
dom  perturbations  of  the  best  local  optimum  in  order  to  lo¬ 
cate  a  (hopefully)  global  optimum.  As  with  any  non-convex 
global  problem,  we  have  no  guarantee  of  finding  the  global 
maximum,  however,  in  experiments  with  synthetic  data  the 
method  described  above  is  able  to  find  close-to-optimal  so¬ 
lutions. 

Once  the  optimal  class  priors  have  been  determined,  the 
mixture  (Equation  4)  constitutes  our  estimate  of  the  proba¬ 
bility  density  for  the  thermal  flux  in  the  current  video  frame. 
Recall  that  our  intermediate  goal  is  to  obtain  the  posterior 
probabilities  for  each  of  our  constituent  classes  (skin,  cov¬ 
ered  skin  and  background).  These  can  be  computed  from 
the  information  at  hand  via  Bayes  rule,  which  in  this  con¬ 
text  takes  the  form 


P‘(Cj|r) 


7 Tjpj(r) 

Eiii  <mry 


(ID 


Equation  1 1  allows  us  to  compute,  for  any  given  pixel 
in  the  current  frame,  the  likelihood  of  belonging  to  each 
class,  and  thus  constitutes  a  soft  segmentation  of  the  image. 
This  process  can  be  repeated  for  each  incoming  frame,  to 
obtain  updated  estimates  based  on  the  latest  data.  While 


Fig.  5.  Left:  Training  densities  for  skin,  clothing  and  back¬ 
ground.  Right:  Comparison  of  the  histogram  for  the  LWIR 
frame  in  Figure  4  with  maximum  likelihood  estimate  based 
on  the  training  densities  on  the  left. 


we  have  adorned  our  class  conditional  densities  P\  with  a 
superscript  denoting  time,  we  have  not  yet  explained  how 
those  densities  vary  as  t  changes.  The  time-adaptation  of 
the  class  conditional  densities  is  discussed  in  Section  5,  after 
the  tracking  procedure  has  been  introduced. 

A  result  of  applying  the  posterior  density  estimation  de¬ 
scribed  above  to  a  LWIR/MWIR  image  is  shown  in  Fig¬ 
ure  4,  where  we  can  see  probability  images  for  exposed 
skin,  covered  skin  and  background.  Some  intuition  into  the 
maximum  likelihood  method  may  be  gained  from  Figure  5, 
where  we  see  the  training  densities  (building  blocks),  to¬ 
gether  with  a  comparison  of  the  actual  frame  histogram  with 
the  ML  estimate.  Note  how  the  class  priors  are  correctly  es¬ 
timated,  yielding  a  very  good  semi-parametric  estimate  of 
the  actual  density. 

We  should  mention  that  it  is  possible  to  incorporate  the 
class  priors  into  the  variable  representing  the  tracking  state 
(see  Section  4),  but  doing  so  increases  the  dimensionality  of 


Fig.  4.  Probabilities  of  skin,  clothing  or  hair  covered  skin  and  background  computed  via  Equation  1 1 . 


the  state  space.  This  degrades  our  ability  to  estimate  prob¬ 
ability  densities,  and  sharply  increases  the  computational 
cost  of  such  estimation.  We  avoid  this  by  separating  the 
estimation  of  class  priors  from  the  tracking  stage  itself.  Fur¬ 
thermore,  the  class  priors  are  independent  of  the  position  of 
the  tracked  object  within  the  scene,  and  of  its  motion  pa¬ 
rameters,  so  by  decoupling  their  estimation  from  that  of  the 
tracking  state  variable  we  incur  no  accuracy  cost. 


4.  TRACKING  SKIN  IN  THERMAL  IR 

Let  us  review  the  basics  of  Bayesian  tracking.  Once  a  soft 
segmentation  has  been  computed,  we  can  use  it  to  track 
faces  in  the  scene.  We  model  faces  simply  as  arbitrarily 
oriented  ellipses,  with  variable  sizes  and  positions.  Let  X 
denote  the  state  space  of  all  such  ellipses.  The  task  of  track¬ 
ing  consists  of  selecting  an  element  of  X  for  each  video 
frame  at  time  t.  More  generally,  one  normally  wishes  to  es¬ 
timate  a  probability  density  on  the  state  space,  encoding  the 
likelihood  that  the  tracked  object  is  in  a  given  configuration. 
Once  this  density  has  been  computed,  a  number  of  estima¬ 
tors  can  be  applied  in  order  to  recover  the  single  state  which 
we  believe  best  corresponds  to  the  object’s  parameters.  In 
our  case,  we  use  the  MAP  estimator,  which  simply  selects 
the  state  with  highest  likelihood. 

Reasoning  on  the  relative  likelihood  of  different  states 
in  A  is  based  on  a  series  of  independent  observations  up 
to  time  t ,  Zt  =  {Zi,  Z2, . . . ,  Zt},  corresponding  to  con¬ 
secutive  frames  of  video.  These  observations  are  related  to 
states  in  A  via  an  observation  model  Q{ Zt|xt).  Further¬ 
more,  successive  states  are  related  via  a  dynamical  model 
x*  =  /(xt_i),  which  expresses  the  physics  that  govern  the 
system.  For  instance,  it  might  relate  velocity  to  position,  or 
it  might  describe  how  uncertainty  increases  over  time.  Al¬ 
though  this  rule  may  not  be  known  explicitly,  one  assumes 
that  there  is  some  model  of  it  that  is  known,  and  is  expressed 
as  <3(xt|xt_i). 

Armed  with  these  models,  and  the  observations  Zt,  it 


is  possible  to  determine  the  desired  probability 

density  on  the  state  space  A',  by  an  application  of  Bayes’ 
rule  combined  with  the  dynamics  equation  to  obtain: 


Qt{*t\Zt) 


Qt(Zt) 


/ 

J  Xf 


<7/(x,|x,  i)Qt  i(x,  „ii2(  .jV/x,  ,J 


(12) 


If  the  densities  in  Equation  12  are  not  Gaussian,  and/or 
the  dynamics  model  is  non-linear,  the  numerical  computa¬ 
tion  of  Equation  12  is  complex  and  time  consuming,  as  there 
is  no  closed-form  method  for  evaluating  the  left-hand- side. 
As  is  now  standard  in  the  tracking  literature,  we  use  a  parti¬ 
cle  filter  method  for  estimating  the  posterior  densities,  as  in 
the  CONDENSATION  [10,  11]  algorithm. 

To  evaluate  Q( Zt|xt),  where  Zt  is  an  observation  (i.e. 
an  image),  we  look  at  a  series  of  fixed-length  segments  pierc¬ 
ing  the  boundary  of  the  ellipse.  In  order  to  minimize  compu¬ 
tational  time,  the  segments  lie  on  a  fixed  grid  on  the  image 
plane,  as  shown  in  Figure  6.  We  wish  to  compute  the  like¬ 
lihood  that  a  conjectured  model  (an  ellipse  in  this  case)  ex¬ 
plains  the  observed  data.  That  is  the  likelihood  that  a  given 
ellipse  coincides  with  the  boundary  of  a  face.  If  the  seg¬ 
ments  piercing  the  ellipse’s  boundary  are  not  too  close  to 
each  other,  this  can  be  well  approximated  by  the  product 
of  the  likelihoods  that  each  single  segment  intersects  the 
boundary  of  a  face.  Thus  it  remains  to  compute  that  like¬ 
lihood  for  a  single  segment. 

This  quantity  is  computed  under  the  hypothesis  that  the 
conjectured  ellipse  does  indeed  correspond  with  a  face  bound¬ 
ary.  Under  that  hypothesis,  we  can  assume  that  radiances  on 
the  ‘interior’  half  of  the  segment  are  drawn  independently 
and  identically  distributed  with  respect  to  the  class  condi¬ 
tional  density  for  skin,  as  estimated  in  Section  3.  On  the 
‘exterior’  half  of  the  segment,  we  can  assume  that  the  ra¬ 
diances  have  been  drawn  likewise,  but  with  respect  to  the 
complementary  density  (the  probability  of  not- skin).  In  or¬ 
der  to  cope  with  the  uncertainty  as  to  the  appearance  of  pix- 


Fig.  6.  Example  of  segments  on  a  fixed  grid  used  to  compute 
the  likelihood  of  the  conjectured  model  given  the  data. 

els  immediately  nearby  the  boundary  and  the  non-elliptical 
nature  of  the  human  face,  we  may  ignore  radiances  com¬ 
ing  from  pixels  on  the  segment  immediately  adjacent  to  the 
boundary  of  the  ellipse.  Decomposing  the  ith  segment  as  a 
union  of  interior,  exterior  and  middle,  we  write  the  corre¬ 
sponding  likelihood  as 


SLi  = 

n  w 

n  pns(r) 

_ interior 

_ exterior 

and  therefore 

N 

Q(z\x)  =  Y[SLi,  (14) 

1  =  1 

where  N  is  the  number  of  segments  piercing  the  ellipse 
x  G  £ .  In  order  to  obtain  a  more  robust  estimate  of  the  rel¬ 
ative  likelihoods  for  different  hypothesized  states,  instead 
of  taking  the  product  of  the  individual  pixel  posterior  likeli¬ 
hoods,  we  average  the  likelihoods  for  groups  of  three  pixels 
along  the  piercing  segment,  and  multiply  those  values  as  in 
Equation  13. 

Since  faces  are  positioned  above  the  neck,  there  is  nor¬ 
mally  no  direct  transition  from  skin  to  background  or  cloth¬ 
ing  at  the  bottom  boundary  of  the  face.  Computing  Equa¬ 
tion  13  including  segments  piercing  the  lower  portion  of  the 
ellipse  would  bias  our  likelihood  estimate,  since  we  do  not 
expect  to  see  skin/not- skin  transitions  in  that  area.  There¬ 
fore,  we  ignore  an  elliptical  arc  spanning  an  angle  of  45° 
about  the  bottom  of  the  ellipse. 

We  use  a  simple  dynamical  model  of  the  form 

Qt(xt |xt_i)  =  K(x t-1  -  vt),  (15) 


where  K  is  a  kernel  composed  of  two  Gaussian  distribu¬ 
tions,  as  pictured  in  Figure  7.  One  of  the  Gaussians  has 
small  variance,  and  is  intended  to  thoroughly  explore  the 
area  of  state  space  where  the  maximum  likelihood  is  ex¬ 
pected  to  occur.  The  second,  higher  variance  component  is 
meant  to  explore  the  surrounding  area  in  case  the  estimated 
position  for  the  maximum  likelihood  state  is  incorrect.  The 
vt  term  in  Equation  15  is  an  estimate  of  the  velocity  at  which 
the  tracked  object  is  moving  in  the  plane,  and  thus  all  but 
two  of  its  five  coordinates  are  zero.  We  obtain  an  estimate 
of  vt  using  a  recursive  filter  on  instantaneous  estimates  vu 
as  follows 

vt  =  E[Qt-i(xt-<t\Zt-i)]  ~  E[Qt-i(xt-2\Zt-2)]tt6) 
vt  =  0i>t  +  (  l-/3)d*-x,  (17) 

where  (3  is  the  learning  rate,  which  we  set  at  0.25.  These 
simple  dynamics  give  us  a  rather  robust  estimate  of  velocity, 
under  the  assumption  of  generally  uniform  motion.  The  es¬ 
timate  degenerates  somewhat  when  the  tracked  object  speeds 
up  or  changes  direction,  but  it  recovers  quickly,  as  the  high 
variance  component  of  Equation  15  searches  for  the  object 
in  an  extended  neighborhood  of  its  expected  location. 

Much  like  for  the  class  priors  estimated  in  Section  3,  it  is 
possible  (maybe  even  customary  [10,  11,  12,  13])  to  include 
the  velocity  parameters  into  the  state  variable  for  the  tracked 
object.  However,  once  again,  this  increases  the  dimension¬ 
ality  of  the  state  space,  thus  rendering  density  estimation 
less  accurate  and  more  computationally  expensive.  For  that 
reason,  we  separate  the  velocity  estimation  from  that  of  the 
other  state  variables.  Our  velocity  estimate  is  not  meant 
to  be  very  precise,  just  a  rough  guess  to  aid  the  Bayesian 
tracker  by  exploring  more  thoroughly  the  space  about  the 
expected  location  of  the  target.  Experimentally,  we  see  that 
the  estimate  is  accurate  enough  to  achieve  this  goal. 

5.  TRACKING-DRIVEN  ADAPTATION 

Section  3  introduced  the  ideas  involved  in  estimating  the 
posterior  flux  densities  for  each  material  class,  based  on 
hand-labeled  training  data,  maximum  likelihood  estimation 
of  class  priors  and  Bayes  Rule.  However,  it  did  not  deal 
with  the  issue  of  adaptation  of  the  class  conditional  densi¬ 
ties,  moving  them  away  from  the  training  data  to  better  fit 
the  observed  radiances.  That  is  the  subject  of  the  current 
section. 

Let  us  suppose  for  a  moment  that  at  a  given  time  step, 
we  know  the  tracker  has  successfully  located  the  skin  re¬ 
gion.  Under  this  assumption,  we  have  access  to  a  sample  of 
radiances  from  the  skin  class  (those  pixels  in  the  interior  of 
the  region)  which  is  more  representative  of  the  appearance 
of  skin  in  this  particular  video  sequence,  and  for  this  par¬ 
ticular  individual,  than  the  training  data  used  to  estimate  the 
skin  density.  Therefore  we  should  be  able  to  update  our  skin 


Fig.  7.  Mixture  of  Gaussians  kernel  used  in  the  dynamics 
model. 

density  estimate  to  better  match  the  current  conditions.  Of 
course,  in  reality  we  do  not  know  whether  the  tracker  has 
correctly  located  the  skin  area,  and  thus  cannot  directly  up¬ 
date  our  estimates.  We  now  propose  a  confidence  criterion 
which  will  allow  us  to  carry  through  a  similar  procedure 
without  resorting  to  unobtainable  ground  truth  information. 

The  result  of  our  CONDENSATION  tracker  developed 
in  Section  4  is  a  probability  density  function  Q  (or  an  es¬ 
timate  thereof)  on  a  configuration  space  representing  the 
possible  states  of  the  tracked  target  in  the  scene.  This  den¬ 
sity  represents  the  likelihood  that  the  tracked  object  is  in  a 
certain  configuration  given  the  current  video  frame  and  the 
history  of  the  object’s  states.  In  Section  4  we  used  the  MAP 
estimator  to  arrive  at  the  position  of  the  object  based  on  this 
density.  We  propose  to  use  the  uncertainty  in  Q  as  an  in¬ 
dication  of  how  likely  the  tracker  is  to  have  succeeded  in 
locating  the  target.  The  uncertainty  in  Q  can  be  measured 
by  its  entropy: 

H(Q)  =  J^QlogQ,  (18) 

where  the  logarithm  is  understood  to  be  zero  wherever  Q 
vanishes.  It  can  be  shown  that  for  continuous  spaces,  the 
distribution  with  lowest  entropy  is  the  delta  distribution,  and 
the  highest  entropy  for  a  fixed  variance  is  attained  by  the 
normal  distribution  with  that  variance.  For  discrete  proba¬ 
bility  spaces,  the  lowest  entropy  is  achieved  by  a  distribu¬ 
tion  whose  mass  is  concentrated  at  a  single  point,  and  the 
highest  by  the  uniform  distribution. 

Estimating  the  uncertainty  of  a  high-dimensional  non- 
parametric  density  is  not  a  computationally  straightforward 
task.  This  is  a  direct  consequence  of  Bellman’s  curse  of 
dimensionality.  [14]  While  the  entropy  as  defined  in  Equa¬ 
tion  18  gives  an  exact  measure  of  the  uncertainty  of  the  un¬ 


derlying  distribution,  we  can  construct  a  related  measure 
which  can  be  estimated  much  more  readily.  Let  the  un¬ 
derlying  space  of  the  probability  density  Q  be  Mn,  and  for 
1  <  i  <  n,  denote  by  Qi  the  marginal  density  of  Q  with 
respect  to  the  ith  coordinate: 

Qi(xi)  =  J  Q(x  i,  •  •  • ,  xn)dx  i  ...dxi...  dxn,  (19) 

where  dxi  denotes  ommission  of  the  ith  coordinate  differ¬ 
ential.  The  chain  rule  for  entropies  states  that 

-t  n  i  n 

H(Q )  =  -  53 +  o  E  H(Qi\Qi)>  (2°) 

Tl.  Z  .  . 

1=1  2,J  =  1 

where  the  second  term  is  the  sum  of  the  relative  entropies 
between  the  respective  marginals.  On  the  other  hand,  we 
also  have  that 

n  1  n 

H(Q)  =J2H(Qi)  -  2  E  HQi,Qj),  (21) 

i= 1  i,j  =  l 

where  the  second  term  is  the  sum  of  the  mutual  informations 
between  the  respective  marginals.  It  follows  from  Equa¬ 
tions  20  and  21  that 

-i  n  n 

-YJH(Qi)<H(Q)<YJH{Qi),  (22) 

nU  ti 

with  equality  attained  on  the  left  when  all  marginals  of  Q  are 
equal,  and  on  the  right  when  Q  is  a  product  of  its  marginal 
densities. 

As  a  consequence  of  the  previous  discussion,  for  a  fixed 
dimension,  the  average  entropy  of  the  marginals  of  Q ,  de¬ 
noted  H(Q ),  is  uniformly  equivalent  to  the  entropy  of  Q 
itself,  and  thus  is  equally  useful  to  us  as  a  measure  of  the 
uncertainty  in  Q.  In  contrast  with  H(Q ),  however,  H(Q ) 
can  be  easily  estimated  even  for  non-parametric  densities, 
since  it  relies  only  on  one-dimensional  computations. 

We  can  learn  the  relative  entropy  values  for  ‘good’  and 
‘bad’  tracking  by  observing  the  tracker’s  behavior  on  train¬ 
ing  sequences.  From  this  we  obtain  upper  and  lower  bounds 
for  the  mean  marginal  entropy  of  the  posterior  state-space 
density  Qt ,  denoted  Hm[n  and  ^max-  Using  these  learned 
bounds,  we  define  the  adaptation  rate  by 

at  =  mm(max(-= - — — - ,  0),  1).  (23) 

ttmax  -t^min 

Using  the  adaptation  rate,  we  modify  our  class-conditional 
density  for  skin,  as  follows 

P{+%  =  atPt  +  (l-at)Zt,  (24) 

where  is  a  non-parametric  estimate  of  the  greyvalue  dis¬ 
tribution  strictly  inside  the  MAP  tracking  state. 


Fig.  8.  Entropy  as  a  function  of  time  for  an  intentionally 
distracted  CONDENSATION  tracker.  Note  the  rise  and  fall 
in  entropy  as  the  tracker  loses  and  re-acquires  the  target. 


Fig.  9.  Estimates  of  the  probability  of  skin  before  and  after 
adaptation. 

Figure  9  shows  the  effect  of  adaptively  learning  the  prob¬ 
ability  density  for  skin  based  on  entropy-weighted  tracking 
results.  These  images  show  the  (posterior)  probability-of- 
skin  images  without  and  with  adaptation.  Note  how  some 
background  objects  such  as  the  warm  chassis  of  a  computer 
monitor  look  somewhat  like  skin  on  the  image  on  the  left. 
Additionally  the  pixels  on  the  wall  behind  the  subject  have  a 
small,  but  non-negligible  probability  of  skin.  With  tracking- 
driven  adaptation  we  obtain  the  image  on  the  right,  where 
essentially  all  distractors  have  been  ruled  out,  and  the  only 
pixels  with  significant  probability  of  being  skin  are  those  on 
the  subject’s  face. 

6.  EXPERIMENTAL  RESULTS 

We  performed  tracking  experiments  on  several  LWIR  and 
MWIR  indoor  image  sequences.  Training  and  testing  data 
was  acquired  and  calibrated  according  to  the  procedures 
outlined  in  Section  2.  Using  a  typical  hand-segmented  frame 
from  the  training  sequence,  we  create  initial  histogram  es¬ 
timates  for  the  class-conditional  densities  of  skin,  clothing 
or  hair-covered  skin,  and  background.  These,  together  with 
a  test  sequence,  are  the  initial  inputs  to  our  tracker.  Typical 


results  for  the  soft  segmentation  portion  of  the  algorithm 
can  be  seen  in  Figure  4. 

Detection  of  the  face  and  tracker  initialization  are  done 
automatically  by  seeding  the  first  image  frame  with  a  uni¬ 
form  distribution  of  particles  representing  ellipses  at  differ¬ 
ent  positions,  sizes,  and  orientations.  After  the  first  frame, 
the  estimated  densities  evolve  according  to  the  Bayesian 
tracking  procedure  in  Section  4.  Figure  10  shows  the  initial 
estimate  of  the  face  location,  obtained  without  operator  in¬ 
tervention,  and  four  typical  frames  from  an  80  frame  MWIR 
sequence.  Note  that  the  initial  estimate  is  rather  good,  but  it 
greatly  improves  as  the  tracking  process  evolves. 

We  collected  ground-truth  positions  of  the  subject’s  head 
for  the  sequence  in  Figure  10,  by  hand-placing  the  ellipse 
best  fitting  the  face  in  each  frame.  Figure  1 1  shows  a  com¬ 
parison  of  ground-truth  x—  and  //-coordinates  of  the  center 
of  the  ellipse  for  each  of  the  80  frames  in  the  sequence.  The 
mean  absolute  (L1)  error  for  the  ^-coordinate  estimate  is 
1.6  pixels,  with  a  standard  deviation  of  1.37  pixels,  while 
for  the  //-coordinate  we  obtain  a  mean  error  of  2.9  pixels, 
with  a  standard  deviation  of  2.0  pixels.  We  should  note  that 
estimating  the  vertical  position  of  the  best  bounding  ellipse 
is  a  harder  task  for  both  the  human  operator  and  the  auto¬ 
mated  tracker,  since  the  hairline  and  chin  are  not  as  clear  in 
the  images  as  the  sides  of  the  face. 

We  can  validate  the  dynamical  model  in  Equation  15  by 
comparing  the  velocity  estimated  using  the  recursive  filter 
(Equation  17)  with  that  obtained  by  differencing  the  ground- 
truth  position  data.  This  comparison  is  shown  in  Figure  1 1 , 
in  which  we  see  that  after  a  brief  period  of  uncertainty,  the 
estimate  is  a  good  approximation  to  the  observed  velocities. 
Recall  that  this  estimate  is  obtained  without  adding  velocity 
variables  to  the  state  space,  and  is  therefore  not  the  same 
as  that  which  we  might  obtain  by  differencing  the  position 
estimates.  In  fact,  our  velocity  estimate  is  predictive,  and  it 
does  not  lag  one  frame  behind  the  position  estimate. 

Lastly,  we  should  mention  the  computational  cost  in¬ 
volved  in  tracking  using  this  method.  Up  to  the  present  time, 
we  have  made  no  effort  to  optimize  the  tracker  for  real-time 
performance.  As  a  result,  it  currently  takes  approximately 
5  seconds  to  process  each  frame,  on  dual  PHI  lGhz  com¬ 
puter  with  512Mb  of  memory.  This  falls  clearly  short  of  the 
real-time  standard,  but  we  believe  that  judicious  optimiza¬ 
tion  can  yield  processing  times  of  over  5Hz  on  currently 
existing  off-the-shelf  hardware. 

7.  CONCLUSIONS 

We  introduced  a  methodology  for  tracking  human  faces  in 
calibrated  thermal  infrared  imagery.  The  use  of  calibrated 
imagery  allows  us  to  use  training  data  acquired  before  the 
tracking  stage,  with  the  assurance  that  its  thermal  flux  val¬ 
ues  will  be  comparable  to  those  measured  during  tracking. 


Fig.  10.  Initial  estimate  of  face  location  (left),  and  tracking  results  for  frames  7,  18,  38  and  78  of  an  80  frame  MWIR 
sequence. 


Frame 


Fig.  11.  Top:  Estimates  and  ground-truth  positions  for 
MWIR  sequence  in  Figure  10.  Bottom:  Estimates  and 
ground-truth  velocities  for  MWIR  sequence  in  Figure  10 


This  calibration  step  is  a  critical  difference  between  track¬ 
ing  in  the  visible  and  thermal  domains.  When  working  with 
visible  imagery,  one  has  to  worry  about  the  effects  of  light¬ 
ing  and  color  constancy  (or  lack  thereof)  on  the  acquired 
data.  In  sharp  contrast  with  that  situation,  calibrated  ther¬ 
mal  imagery  gives  us  lighting  invariant  data,  very  suitable 
for  robust  tracking. 

Our  method  consists  of  several  portions  of  independent 
interest.  Firstly  we  outline  a  scheme  for  modeling  and  seg¬ 
menting  skin,  covered  skin  and  background  in  thermal  im¬ 
agery  based  on  training  data  plus  a  maximum  likelihood  es¬ 
timation  step.  We  model  each  frame’s  histogram  as  a  mix¬ 
ture  of  non-parametric  densities,  and  estimate  the  class  pri¬ 
ors  in  order  to  best  approximate  the  actual  flux  density. 

Tracking  proceeds  along  the  lines  of  the  CONDENSA¬ 
TION  algorithm,  with  a  likelihood  model  based  on  the  pos¬ 
terior  densities  for  each  of  the  three  segmented  classes.  In 
section  6  we  show  results  of  applying  the  tracker  to  stan¬ 
dard  indoor  MWIR  scenes.  Furthermore,  we  compare  our 
results  to  hand-measured  ground  truth  positions  and  veloc¬ 
ities.  This  comparison  shows  that  our  tracker  is  very  accu¬ 
rate,  usually  within  2  pixels  of  the  hand-labeled  face  cen¬ 
troid. 

We  go  beyond  simple  tracking,  by  allowing  the  track¬ 
ing  results  to  feed  back  into  the  segmentation  stage.  This 
is  accomplished  using  the  average  marginal  entropy  of  the 
posterior  state  density  as  a  measure  of  tracking  accuracy. 
Based  on  this  measure,  we  can  adapt  our  class-conditional 
densities,  thus  yielding  a  better  segmentation  (and  therefore 
tracking)  that  can  be  obtained  with  pre-computed  training 
data  alone. 

A  number  of  extensions  of  this  work  should  be  consid¬ 
ered,  including  handling  multiple  subjects  and  the  resulting 
occlusion  problem.  The  most  interesting  area  for  future  re¬ 
finement  of  this  method  lies  within  the  feedback  loop  be¬ 
tween  the  tracking  and  segmentation  stages.  While  we  pro¬ 
vide  an  effective  means  of  adapting  the  class-conditional 
densities  based  on  an  entropy  weighting  criterion,  there  is 
certainly  room  for  further  experimentation  in  this  area.  As 
we  mentioned  above,  real-time  implementation  of  the  algo¬ 
rithm  remains  a  challenge,  but  we  believe  it  possible  on  off- 
the-shelf  hardware,  after  careful  optimization.  The  promis¬ 
ing  preliminary  results  in  this  paper,  together  with  the  illu- 


mination  invariance  properties  afforded  by  thermal  infrared 
imagery,  make  this  an  attractive  avenue  of  investigation. 
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